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AN MDL APPROACH TO THE CLIMATE SEGMENTATION 

PROBLEM 1 

By QiQi Lu, Robert Lund and Thomas C. M. Lee 

Mississippi State University, Clemson University, and Colorado State 
University and Chinese University of Hong Kong 

This paper proposes an information theory approach to estimate 
the number of changepoints and their locations in a climatic time se- 
ries. A model is introduced that has an unknown number of change- 
points and allows for series autocorrelations, periodic dynamics, and 
a mean shift at each changepoint time. An objective function gauging 
the number of changepoints and their locations, based on a minimum 
description length (MDL) information criterion, is derived. A genetic 
algorithm is then developed to optimize the objective function. The 
methods are applied in the analysis of a century of monthly temper- 
atures from Tuscaloosa, Alabama. 



1. Introduction. Changes in station instrumentation, location, or ob- 
server can often induce artificial discontinuities into climatic time series. 
For example, United States temperature recording stations average about 
six station relocation and instrumentation changes over a century of oper- 
ation [Mitchell (1953)]. Many of these changepoint times are documented 
in station histories; however, other changepoint times are unknown for a 
variety of reasons. Even when a changepoint time is known, one may still 
question whether the change instills a mean shift in series observations. This 
paper proposes an information based approach to the multiple changepoint 
identification (segmentation) problem. 

Our methods are specifically tailored to climatic time series in that they 
allow for periodicities and autocorrelations. Multiple changepoint detection 
procedures have been studied under the assumption that the series is driven 
by independent and identically distributed errors [Braun and Miiller (1998), 



Received December 2008; revised September 2009. 

Supported by NSF Grants DMS-07-07037 and DMS-09-05570 and Hong Kong Re- 
search Grant Council Grant CERG 401507. 

Key words and phrases. Changepoints, genetic algorithm, level shifts, minimum de- 
scription length, periodic autoregression, time series. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Statistics. 
2010, Vol. 4, No. 1, 299-319. This reprint differs from the original in pagination 
and typographic detail. 



1 



2 



Q. LU, R. LUND AND T. C. M. LEE 



Caussinus and Mestre (2004), Menne and Williams (2005)]. This is unrealis- 
tic in climate settings where observations display moderate to strong serial 
autocorrelation. Ignoring autocorrelations can drastically alter changepoint 
inferences, as positive autocorrelation can be easily mistaken for mean shifts 
[see Berkes et al. (2006) and Lund et al. (2007)]. Multiple changepoint meth- 
ods for time series data represent a very active area of current research 
[Davis, Lee, and Rodriguez- Yam (2006), Fearnhead (2006)]. Series recorded 
daily or monthly also display periodic dynamics. Our methods allow for sea- 
sonality by employing a time series regression model with periodic features. 
In short, this paper develops a multiple changepoint segmenter that applies 
to a variety of realistic climate series. 

The rest of this paper is organized as follows. Section 2 introduces the 
time series regression model that underlies our work. Section 3 develops an 
objective function for the model. The objective function is a penalized like- 
lihood whose penalty is based on the minimum description length (MDL) 
principle. This modifies Caussinus' and Mestre's (2004) model to allow for 
autocorrelation, seasonal effects, and also changes their likelihood penalty 
to an MDL-based penalty. Each segment of our model is allowed to have 
a distinct mean, but the autocovariance structure of each segment is con- 
strained to be the same. Section 4 presents a genetic-type algorithm capable 
of optimizing the objective function to obtain estimates of the changepoint 
numbers, locations, and the time series regression parameters. Section 5 
presents a short simulation study for feel. Section 6 applies the methods to 
a century of monthly temperatures from Tuscaloosa, Alabama and Section 
7 concludes with comments. 

2. Model description. The object under study is a time series {Xt} gov- 
erned by periodic errors and multiple level shifts. The period of the series is 
T and is assumed known. The series observation during season v, 1 < v < T, 
of the (n + l)st cycle is denoted by X n T+ u - The time-homogeneous and pe- 
riodic notation {X t } and {X n T+v} are used interchangeably, the latter to 
emphasize seasonality. We index the first data cycle with n = so that the 
first observation is indexed by unity. For simplicity, we take d complete cy- 
cles of observations; specifically, the observed data are ordered as X±, . . . , Xjy 
and d = N/T is assumed a natural number. 

The model driving our work is a simple linear regression in a periodic 
environment: 

(2.1) X nT+u =^u + a(nT + v) + 5 nT +v + £ n T+v 

In (2.1) a is a linear trend parameter that is assumed time homogeneous 
for simplicity; [i v is the season v location parameter (a detrended mean in 
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the absence of changepoints) . The errors {ej} have zero mean and are a 
periodically stationary series with period T in that 

(2.2) Co\(e t ,e s ) = Cov(e t+T ,e s+T ) 

for all integers t and s. Many climatic series have periodic second moments 
in the sense of (2.2). For a sample of size N with m < N changepoints, the 
ordered times of the changepoints are denoted by 1 < t\ < r 2 < • • • < r m < N. 
The number of changepoints and the changepoint times are considered un- 
known. There are m + 1 different segments (regimes) during the observation 
record. At each changepoint time, our model allows for a mean shift in the 
observations. Such a structure is described by 

(Ai, l<t<n, 
A 2 , n<t<T2, 

A m+ i, r m <t<JV+l. 

For parameter identifiability, we take Ai =0; otherwise, the Aj's and fi^'s 
would become confounded. For a fixed N, the mean component E[X n T+ v } i n 
(2.1) depends on the T + 1 + m parameters fix, . . . ,f4r, a, and A 2 , . . . , A m+ i. 
Generalizations of (2.1) are mentioned in Section 3 when we derive MDL 
codelengths. 

To describe the time series component {e n T+ u }, we use a causal periodic 
autoregression of order p [PAR(p)]. Such errors are the unique (in mean 
square) solution to the periodic linear difference equation 

v 

(2.3) 

fc=l 

Here, \Zt\ is zero mean periodic white noise with variance <r 2 (i / ) during 
season v. Solutions to (2.3) are indeed periodic with period T in the sense 
of (2.2). PAR models are dense in the set of short memory periodic time 
series and parsimoniously describe many such series; explicit expressions for 
many time series quantities are available for PARs. 

In many applications, reference series are available. A reference series is 
a series of the same genre as the series to be studied (the target series) that 
serves to aid changepoint identification. For example, with the Tuscaloosa 
temperatures examined later, series from nearby Greensboro AL, Selma AL, 
and Aberdeen MS are available over the same period of record. By con- 
structing a target minus reference difference series, mean shifts induced by 
changepoints are sometimes illuminated. When the reference series is highly 
positively correlated with the target series, the target minus reference se- 
ries will have smaller autocorrelations than the target series at all lags (this 
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happens when the target and reference series have the same periodic autoco- 
variance structure and the correlation between these two series exceeds 1/2 
at all times). Also, the linear trend assumption is typically more plausible 
for target minus reference differences than the target series, as long-memory 
and other nonlinear features can be eliminated in the subtraction. More- 
over, the seasonal mean cycle is frequently reduced or altogether eliminated 
in target minus reference series. Drawbacks with reference series lie with 
additional undocumented changepoints that the reference series may intro- 
duce. Algorithms aimed at resolving which series among target and multiple 
references is responsible for any found changepoints are now available [see 
Menne and Williams (2005, 2009)], but these works do not consider seasonal 
features or autocorrelated errors. 

Note that the difference of two series governed by (2.1) again lies in (2.1). 
Hence, in the next three sections we simply consider a single series satisfy- 
ing (2.1). Reference series will return in Section 6. 

The parameters in the model will become important later. The PAR(p) 
model, including the T white noise variance parameters, has (p + 1)T auto- 
covariance parameters. For a fixed to, there are also the changepoint times 
Ti, . . . ,r m and the mean shifts A2, • • • , A m +i. Finally, a trend component a 
and the seasonal means fj,i,...,fiT are present. Hence, given p and m, there 
are 2m + 1 + (p + 2)T model parameters. Given p and m, we will need to es- 
timate ti, . . . , T m , A2, • • • , A m+ i, hi, . . . , ht, a, and all PAR(p) parameters. 
Developing and optimizing an objective function for this purpose will be the 
subject of our next two sections. 

Before leaving the model description, we make a comment. The model 
studied here allows for process changes at the changepoint times in the form 
of level shifts. This is reasonable in climate cases [Vincent (1998), Menne and 
Williams (2005), Lund et al. (2007)]. In other applications such as speech 
recognition and finance, it may be more realistic to keep mean process levels 
fixed and allow the time series parameters to change at each changepoint 
time [see Inclan and Tiao (1994), Chen and Gupta (1997), Davis, Lee, and 
Rodriguez- Yam (2006)]. 

3. An MDL objective function. To fit the above model, estimates of the 
changepoint numbers and locations, as well as the model parameters, are 
needed. Since different changepoint numbers refer to models with a different 
number of parameters, the model dimension will also need to be estimated. 
This is a model selection problem. Popular approaches to model selection 
problems include AIC (Akaike Information Criterion), BIC (Bayesian Infor- 
mation Criterion), cross-validation type methods, and MDL methods. For 
problems that involve the detection of regime changes, MDL methods often 
provide superior empirical results [e.g., Lee (2000, 2002), Davis, Lee, and 
Rodriguez- Yam (2006)]. This superiority is likely due to the fact that both 
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AIC and BIC place the same penalty on all parameters, regardless of the 
nature of the parameter (e.g., mean shift magnitudes and changepoint times 
receive the same penalty). On the other hand, MDL methods can situation- 
ally tailor penalties for parameters of different natures, thereby accounting 
for whether the parameter is real or integer-valued. 

The MDL principle was developed by Rissanen (1989, 2007) as a general 
method for solving model selection problems. It has roots in coding and 
information theories. In brief, MDL defines the best fitting model as the 
one that enables the best compression of the data; for the current problem, 
the data are the observed {A'j}. There exist several versions of MDL; the 
so-called two-part MDL is used here. For introductory MDL material, see 
Hansen and Yu (2001) and Lee (2001). 

The rest of this section develops a two-part MDL objective function for 
fitting a good model. The main idea behind the two-part MDL is described 
as follows. First, the data {Xt} is decomposed into two parts, the fitted 
candidate model and its corresponding residuals. MDL methods then cal- 
culate the total codelength (i.e., the amount of computer memory) required 
for storing both parts as a sum of the codelength of the two parts. Finally, 
MDL methods define the best fitting model as one that produces a min- 
imal codelength. Intuition behind MDL methods lies with why minimum 
codelength models are also good statistical models. Essentially, it is that 
both good compression and good statistical models are capable of capturing 
regularities in the data. 

To proceed, let CL(z) denote the codelength of the object z. Also write 
a candidate fitted model as Ai and its residuals as {it}- The codelength is 
additive in that 

(3.1) CL({X t }) = CL(M) + CL({i t }). 

The term CL(.M) in (3.1) can be viewed as a model complexity term, while 
Ch({it}) can be viewed as a data fidelity term. Our next task is to obtain a 
computable expression for CL({Xt}) that can be minimized. We begin with 
the calculation of CL(A^). 

An important result of Rissanen (1989) is that the maximum likelihood 
estimate of a real-valued parameter computed from a series of iV observa- 
tions (N is large) can be effectively encoded with log 2 (iV)/2 bits. The trend 
parameter a hence requires log 2 (iV)/2 bits to encode. The seasonal mean 
parameters are effectively estimated via seasonal sample means, each 
of which contributes log 2 (d)/2 bits to the codelength. Given values of the 
changepoint times t±, . . . ,r m , the mean shift parameter Aj can be estimated 
with data from the jth segment only. Hence, Aj requires log 2 (r ? ' — Tj_i)/2 
bits to encode for 2 < j < m + 1 (T m +i = N + 1 is taken as a convention). 
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Hence, the portion of the codelength from mean parameters in the time 
series regression (i.e., a,{fi u }^ =1 , and {Aj}™^ 1 ) is 

^ io g2 (AQ ri og2 (d) i^ 1 , 

(3-2) + + 2 2^ log 2( r i ~ r i-i)- 

3=2 

The PAR(p) time series parameters [(f)k{v) for l</c<p;l<^<T and 
cr 2 (v) for 1 < f < T] are also real valued. Because {e^} is a zero mean pro- 
cess, we need only consider the zero mean version of this model. In this case, 
the PAR parameters can be estimated in an efficient manner via seasonal 
versions of the Yule- Walker equations [see Pagano (1978)]. The necessary 
equations for this task are presented in Shao and Lund (2004). Yule- Walker 
PAR parameter estimators are asymptotically most efficient [Pagano (1978)]; 
in fact, these estimators are the likelihood estimators except for the edge- 
effects (i.e., the likelihood is conditional on the first p observations). The 
Yule-Walker estimators can be computed from the sample autocovariances 
Ju(h) over the lags h = 0, . . . ,p. The lag h sample autocovariance at sea- 
son v is defined as %(h) = En=o £ nr+v e nT+i/-/ii where St is taken as 
zero should a t < be encountered in the summation. Observe that 7^(0) is 
a function of d series observations for each fixed v. Moreover, ^/ u (h) is es- 
sentially computed from 2d observations. Hence, the total codelength from 
PAR(p) parameters is 

riog 2 (ri) P T log 2 (2d) 
[6 - 6) 2 + 2 

The parameters n, . . . ,r m are integers and must be treated as such. Ar- 
guing as in Davis, Lee, and Rodriguez- Yam (2006), an integer parameter 
bounded by Q takes log 2 (Q) bits to encode. Since the Tj's are ordered, we 
have Tj < Tj+i. This differs from Davis, Lee, and Rodriguez- Yam (2006) 
in that we do not loosely bound tj — Tj-% by iV for each j. In short, the 
codelength induced by the changepoint times that we use is 

m 

(3.4) 5>g 2 (T 3 -)+log 2 (iV). 

Finally, the model orders p and m contribute 

(3.5) log 2 (p) + log 2 (m) 

bits to the codelength. While m is bounded by N, typical values of m are 
significantly smaller than and a penalty of log 2 (A) would be too much 
for m changepoints. 
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q I ™ + l 

CL(M) = -log 2 (iV) + Tlog 2 (d) + log^-r^) 

i=2 

(3-6) 

+ pTl ° g ^ 2d) +5^1og 2 (r i ) + log 2 (ro) + log 2 (p). 

Moving to CL({ft}), a fundamental result of Rissanen (1989) is that this 
quantity equals the negative logarithm (base 2) of the likelihood of the 
fitted model Ai. For the present problem, this conditional likelihood can 
be calculated as follows. A Gaussian joint density of observations from the 
model, denoted by L, takes the classical innovations form modified to allow 
for series periodicities and level shifts at the changepoint times: 



N 



-1/2 



(3.7) 



L = (2tt) 



-N/2, 



vt=l / 



exp 



l^(X,-X,) 2 



- £ 

t=l 



Here, Xt = P(Xt \X\, . . . , Xt-i, 1) is the best one-step-ahead predictor of 
Xt from linear combinations of a constant and X%, . . . ,Xt-±. Also, vt = 
E[(Xt — Xt) 2 ] is the mean squared error (unconditional) of the one-step- 
ahead predictor. 

The one-step-ahead prediction equations and mean squared errors for the 
PAR(p) setup are easily expressed: 

v 

X nT+v = E[X nT+u ) + ^2<pk(v){X nT+ u-k ~ E[X nT+u _ k ]), nT + v>p, 
fc=i 

where E[X n T+ u ] = (J>v + ot(nT + v) + 8 n T+v is the mean function. Computing 
Xt and vt for t <p is done as in Shao and Lund (2004). Taking a negative 
logarithm in (3.7) gives 



AT 1 N " (Y — X V 

(3.8) CL({£ t }) = - log 2 (2^) + - Y, l°g 2 (^) + ^ log 2 (c) E v • 

i=l t=l 1 

Substituting (3.6) and (3.8) into (3.1), we arrive at the following approx- 
imation: 



N 



CL({X t }) = log 2 (e) 



m+l 



|ln(iV) + Tln(d) + 1 E Mri-T H ) + 



pTln(2d) 



3=2 



Info) + ln(m) + ln(p) + — ln(2vr) 

J=2 
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t=i t=i 1 

Because N, d, and T are constant, our objective function for the model M, 
denoted by MDL(A4), can be taken as 



MDL(^) = \ ^ ln(r, - r^) + + ^Infa) 

3=2 3=2 

+ ln(m) + ln(p) + ln(t*) + 1 £ (Xt ~ Xf)2 . 



2^ y ' 2^ v t 

t=\ t=i t 

MDLs for variants of the model in (2.1) are worth mentioning. Should one 
also allow the trend to change with each regime, the codelength becomes, 
after appropriate modification of (3.2), 

MDL(A4) = £ ln( Tj - t^-i) - ln(n - l)/2 + + f>( Tj ) 

j=i 3=2 

+ ln(m) + In(p) + i £ + i £ C**Z*i)! 

t=i t=i * 

where to = 1 is taken as a convention. If the seasonal location parameters 
\l v are consolidated to a single fj,, then an appropriate MDL (this assumes 
a single trend parameter) is 

MDL(.M) = \ |>(Ti " + + f>(Ti) 

3=1 3=2 

3.10 

JV , iV 



+ ln(m) + ln(p) + ^ ln(t*) + i £ C*L^ 



2^ v y 2^ u t 

t=i t=i 1 

which is (3.9) expect for the 2 _1 ln(ri — To) term added in the first summa- 
tion. MDLs for models where the structural form of the regression changes 
segment by segment are harder to quantify, but also have climate ramifica- 
tions and are currently being investigated. 

By an MDL model, we refer to a model A4 that minimizes a MDL 
score over the class of models being considered. Practical minimization of 
MDL(A^) over all admissible models is not a trivial task, which brings us 
to our next section. 
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4. Optimizing the objective function. First, suppose that we know p and 
m and the changepoint times ri,...,r m . Then computation of MDL(A / 1) 
proceeds as follows. Computation of the model codelength given the param- 
eters is straightforward. For computation of the likelihood contribution to 
the codelength, write (2.1) in the general linear models form 

(4.1) X = Df3 + e. 

In (4.1), = (^i,...,/i T ,a,A 2 ,...,A m+1 )', X = (X lt . . . , X N )', e= (si,..., 
£jv)', and D is the N x (T + 1 + m) design matrix 

D=[S\C\R], 

where S is an N x T dimensional seasonal indicator matrix (all entries are 
zero except St tU = 1 if t = £T + v for some t £ {0, . . . , d — 1}), C is an N x 1 
vector with Ct = t, and R is an N x m dimensional matrix with all zero 
entries except Rtj = 1 when time t, 1 < t < N, is observed during regime j 
for 2 < j < m + I. 

We first estimate f3 with ordinary least squares methods. From the esti- 
mated /3, residuals of this model fit are next computed. From these resid- 
uals and a PAR order parameter p, estimates of <f>k{y) for 1 < v < T and 
I <k <p and cr 2 (v) for 1 < v < T are constructed via seasonal Yule- Walker 
moment estimation methods. With estimates of the ^(i^'s and cr 2 (z>')'s, 
one can return to (4.1) and compute generalized weighted least squares es- 
timators of /3. New residuals are computed and the process is iterated in 
a Cochrane-Orcutt fashion [see Cochrane and Orcutt (1949)] until conver- 
gence is achieved. The process gives jointly optimal estimators of (3 and the 
(bk(vYs and Typically only several iterations are needed. 

The above enables us to quickly compute a codelength for fixed values of 
p, m, and ri,...,r m . However, not counting different values of p, there are 
2^ different configurations of m and T\,...,T m that must be considered. In 
other words, the parameter space has a huge cardinality. To optimize the 
codelength over this parameter space, we now introduce a genetic algorithm. 

A genetic algorithm (GA) is a stochastic search that can be applied to 
a variety of combinatorial optimization problems [Goldberg (1989), Davis 
(1991), Reeves (1993)]. The basic principles of GAs were first developed by 
Holland (1975) and are designed to mimic the genetic process of natural 
selection and evolution. GAs start with an initial population of individuals, 
each representing a possible solution to the given problem. Each individ- 
ual or chromosome in the population is evaluated to determine how well it 
scores with respect to the objective function. Highly fit individuals are more 
likely to be selected as parents for reproduction. In a crossover procedure, 
the offspring {children) share some characteristics of the parents. Mutation 
is often applied after crossover to introduce random changes to the current 
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population with a small probability. Mutation increases population diversity. 
The offspring are used to construct a new generation by either a generational 
approach (replacing the whole population) or a steady-state approach (re- 
placing a few of the less fit individuals). This process is repeated until an 
individual is found that roughly optimizes the objective function. 
The GA used in this study is described as follows. 

Chromosome representation: The first step in designing a GA is to create 
a suitable chromosome representation for the problem. Here, any individual 
(model) can be described as a set of parameters: the number of changepoints 
m, the order of the PAR model p, and the changepoint locations t%, . . . , r m . 
Once these parameters are fixed, the regression parameters in the model 
(2.1) can be estimated using the methods described above. Hence, the chro- 
mosome, denoted by u = (m,p, ti, . . . , r m ), is an integer vector of length 
m + 2. The lengths of the chromosomes in the population depend on the 
number of changepoints. A minimum number of observations in each regime 
is set to m{T to ensure that reasonable mean shift estimates are obtained in 
all segments. Here, run is the minimum number of cycles between adjacent 
changepoints. Our work will take m£ = 1 (no changepoints within a year for 
monthly data). Also, we impose the upper bound p max for the order of the 
PAR(p) model; Pmax — 3 is used in the forthcoming simulation study and 
examples. 

Initial population generation: For each individual, the PAR order p is 
first randomly selected with equal probabilities from the set {0, 1, 2, 3}. The 
changepoint numbers m and locations are then independently simulated as 
follows. There is a probability pb, essentially representing the probability 
that any admissible time is selected as a changepoint. Since there can be no 
changepoint before time t = 1 + m{I ', we first examine time t = 1 + m{I ', 
flipping a coin with heads probability pb- If the flip is heads, t = l + m{F 
is declared to be the first changepoint (t\ = 1 + miT) and attention shifts 
to the next possible changepoint time, which is time 1 + 2m (I . But if the 
flip is tails, t = 1 + m{F is not chosen as a changepoint and we move to the 
next location at t = 1 -\- m(F + 1, independently flipping the coin again. The 
process is continued in a similar manner until the last admissible changepoint 
time at t = N — m{F is exceeded. The population size n p = 30 is used in this 
study and pb is set to be 0.06d (six changepoints over a century). 

Crossover: Pairs of parent chromosomes, representing mother and father, 
are randomly selected from the initial population or current population by a 
linear ranking/selection method. That is, a selection probability is assigned 
to an individual that is proportional to the individual's rank in optimizing 
the objective function. The least fit individual is assigned the rank and 
the most fit individual is assigned the rank n p — l. A crossover procedure, as 
explained in the paragraph below, is then applied to the parents to produce 
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offspring for the next generation. The probability that any two parents have 
children, denoted by p c , is set to p c = 1 — me/d. 

In our GA implementation, only one pair of parent chromosomes is chosen 
from the current generation and one child is produced by "mixing" two par- 
ent chromosomes with a uniform crossover. This works as follows. The child's 
PAR order p is either the mother's or the father's PAR order, with both be- 
ing equally likely. The child's changepoint locations are randomly selected 
using all admissible changepoint locations from both mother and father. For 
example, for N = 1200, T = 12, and mi = 1, suppose the mother's chro- 
mosome has 3 changepoints at the times t = 200, 320, 600 and the father's 
chromosome has 4 changepoints at the times t = 205,300, 710,850. First, all 
changepoints from mother and father are mixed together and sorted from 
smallest to largest, yielding the string (200,205,300,320,600,710,850). We 
select the first changepoint of the child at t = 200 with probability 0.5. If 
t = 200 is selected as a changepoint, then we discard the changepoint t = 205 
(it would violate segmentation spacing requirements) and move to the next 
candidate changepoint at t = 300, again doing a fifty-fifty selection/inclusion 
randomization. If t = 200 is not chosen as one of the child's changepoints, we 
move to the next changepoint at t = 205 with the same fifty-fifty selection 
criterion. The child's m is simply the number of retained changepoints. 

Mutation: Mutation is applied to the child after crossover with a constant 
probability p m . The probability p m is typically low; we use p m = 0.05 in the 
following examples. The PAR order p for the new chromosome produced by 
mutation is equal to the child's p with a probability of 0.5. Then change- 
point locations can either take on the corresponding changepoints from the 
child's or be a new set randomly selected from the parameter space. Muta- 
tion ensures that no solution in the admissible parameter space has a zero 
probability of being examined. 

New generation: The steady-state replacement method with a duplication 
check as suggested by Davis (1991) is applied here to form a new generation. 
One advantage of the steady-state approach over the generational approach 
is that it typically finds better solutions faster. In our implementation of the 
steady-state approach, only one individual is replaced in the current gener- 
ation by a child after crossover and/or mutation. This allows parents and 
offsprings to live concurrently, which is true for long-lived species [Beasley, 
Bull, and Martin (1993)]. If the child is already present in the current gen- 
eration, this child will be discarded and another child must be produced by 
the selection-crossover-mutation process. The duplication check is applied 
to all new children until a child is found that is not present in the current 
generation. In this way, duplicate solutions and premature convergence are 
significantly avoided. 

Migration: Migrations act to speed up convergence of the GA and can be 
implemented via a parallel scheme [Davis (1991), Alba and Troya (1999)]. 
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Migration also reduces the probability of premature convergence. The pop- 
ulation is divided into several different sub-populations (islands). Highly fit 
individuals periodically migrate between the islands. The island model GA 
is controlled by several parameters, such as the number of islands Nj, the 
frequency of migration Mi, the number of migrants M n , and the method 
used to select which individuals migrate. The migration policy used here is 
as follow. After every Mj generations, the least fit individual on island j, 
j = 1, . . . ,Nj, is replaced by the best individual on island i, which is ran- 
domly selected among all other islands (j ^ i). Therefore, each island sends 
and receives individuals from different islands throughout the duration of 
the search process. Here, we set Nj = 40, Mi = 5, and M n = 1. 

Convergence and stopping criteria: We follow the criterion of Davis, Lee, 
and Rodriguez- Yam (2006) to declare convergence and terminate the GA. 
If the overall best individual at the end of each migration does not change 
for M c consecutive migrations, then the GA is deemed to have converged to 
this best individual. Additionally, if the total number of migrations exceeds 
a predetermined maximum number M* , then the search process is termi- 
nated and the best individual in the M*th migration is taken as the optimal 
solution to the given problem. The parameters M c and M* are taken as 10 
and 25 in the study, respectively. 

5. A simulation study. This section investigates the accuracy of the 
above methods via simulation. This study is designed to correspond to the 
simulation study in Caussinus and Mestre (2004). Elaborating, we will sim- 
ulate a thousand series and apply our methods to each series. Each series 
contains a century (d = 100) of monthly data (T = 12) with six (m = 6) 
changepoints. This corresponds to the average number of changepoints over 
a century of operation reported in Mitchell (1953). The changepoint mean 
shifts in every series occur at the times t\ = 240, T2 = 480, t% = 600, T4 = 840, 
T5 = 900, and tq = 1020. The error terms {et} are simulated as a Gaussian 
first order periodic autoregression (p = 1) with parameters 4>\(v) and o 2 {y) 
as specified in Table 1; the seasonal means \i v are also listed in Table 1 
and are in degrees Celsius. These values are those that were estimated for 
50 years of monthly temperatures from Longmire, Washington, which was 
studied in Lund et al. (2007). The trend parameter a was set to zero in all 
simulations. 

The magnitude of the mean shifts A2, • • • , A7 are critical. Big mean shifts 
make changepoints easier to detect. To facilitate interpretability, we use a 
common mean shift magnitude A > at all changepoint times. For instance, 
if the current regime has mean level c (trend and seasonal effects are assumed 
zero here), the next regime will have mean c + A or c — A, with a fifty-fifty 
chance of shifting up or down at each changepoint time. It follows that 
A = I Aj - Aj_i I for j = 2, . . . , 7. 
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Table 1 
Simulation parameters 



V 






<r» 


1 


-0.61 


0.272 


2.713 


2 


0.99 


0.284 


2.748 


3 


2.35 


0.478 


1.871 


4 


4.91 


0.286 


1.717 


5 


8.74 


0.335 


2.474 


6 


12.15 


0.279 


2.403 


7 


15.51 


0.245 


2.569 


8 


15.47 


0.137 


1.910 


9 


12.79 


-0.127 


2.826 


10 


7.82 


0.082 


2.488 


11 


2.32 


0.196 


2.394 


12 


-0.25 


0.214 


2.256 



The ability of our model to detect mean shifts can be roughly quantified 
by the mean shift magnitude relative to the process standard deviation (the 
latter averaged over a complete seasonal cycle). A parameter quantifying 
such aspects, denoted by k, is 

A 



£nT+u) 

Better quantifiers of changepoint detection power may well exist, but deriva- 
tion of such quantities would be difficult and is tangential to our points. 
Below, we consider three different k values: 1.0, 1.5, and 2.0. The larger k 
is, the easier it is to detect changepoints. A realization of a temperature 
series with k = 1.5 is plotted in Figure 1 for feel. 

Table 2 and Figure 2 summarize the results of the simulations. Table 2 
reports empirical frequency distributions of the number of estimated change- 
points. Observe that the true value of six changepoints is obtained more fre- 
quently as k increases. When k = 2.0, the percentage of simulations where 
the correct number of changepoints is estimated is 58.9%, which is bet- 
ter than the corresponding 43.4% reported in Caussinus and Mestre (2004) 
that applies to uncorrelated and time- homogeneous settings (i.e., 100 years 
of annual data). In fairness, we note that the equivalent sample size of our 
simulated series (the number of independent data points with the same pe- 
riodic variances) translates to more than the 100 independent data points of 
Caussinus and Mestre (2004) (we will not quantify equivalent sample sizes 
further here). The correct number of changepoints is identified only 0.9% 
when k = 1.0 [this, however, is also slightly better than the corresponding 
result in Caussinus and Mestre (2004)]. It is clear that changepoint numbers 



14 Q. LU, R. LUND AND T. C. M. LEE 




120 240 360 480 600 720 840 960 1080 1200 

Time of Observation 




u} 120 240 360 480 600 720 840 960 1080 1200 

Time of Observation 

Fig. 1. A simulated series with six changepcnnts. 

are underestimated in settings with relatively small k. In fact, the empirical 
mean (standard deviation in parentheses) of the distributions in Table 2 are 
2.74 (1.07) for k = 1.0, 4.34 (1.48) for k = 1.5, and 5.413 (1.20) for k = 2.0. 



Table 2 

Estimated changepoint numbers and 
PAR(l) order when m = 6 



m 


K = 1.0 


K = 1.5 


k = 2.0 





0.1% 


0.0% 


0.0% 


1 


12.2% 


2.6% 


0.0% 


2 


30.1% 


10.9% 


3.7% 


3 


34.5% 


17.5% 


5.4% 


4 


18.3% 


23.6% 


11.8% 


5 


3.9% 


12.9% 


12.5% 


6 


0.9% 


31.9% 


58.9% 


7 


0.0% 


0.6% 


7.1% 


>7 


0.0% 


0.0% 


0.6% 


p=l 


99.9% 


100% 


100% 


p = 


0.1% 


0.0% 


0.0% 
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Fig. 2. Histograms of estimated changepoint times. 



Overall, one sees that changepoint shift sizes are critical in changepoint de- 
tection, that the detection situation is difficult when k is small, but that 
methods work reasonably well when k is relatively large. Using monthly 
data (as opposed to annual averages) also seems to improve changepoint 
detection power. 

As for where the changepoints are estimated to occur, Figure 2 shows his- 
tograms of the estimated changepoint locations, reporting the total number 
of times a changepoint is signaled at time t for 1 < t < N in the 1000 simu- 
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lations. Observe that the histograms have modes around the actual change- 
point times. It is also evident that the changepoints at times 840 and 900 
were the most difficult to detect, a feature attributed to the close proximity 
of the times of these two changepoints (with the fifty-fifty up/down mean 
shift randomization employed, the sign of these two mean shifts differ with 
probability 1/2, in which case their detection is relatively more difficult). 

Note that the correct autoregressive order p = 1 was obtained virtually 
all of the time. Hence, the time series model selection component seems to 
be working well. As changing the trend parameter did not appreciably affect 
results, we will not report separate tables with nonzero trends. 

We now compare the MDL penalty more closely with the Caussinus- 
Lyazrhi penalty used in Caussinus and Mestre (2004). The Caussinus-Lyazrhi 
penalty is larger than AIC or BIC penalties, but does not penalize parame- 
ters in the mean function or consider autocorrelation aspects. To make this 
comparison, 1000 series of length 100 were simulated with six changepoints 
always occurring at the times 20, 40, 50, 70, 75, and 85. The mean shift 
size parameter k was changed to the parameter a in Caussinus and Mestre 
(2004) to mimic their simulations. The errors in the model were assumed to 
be Gaussian and independent. Note that the level of changepoint activity 
relative to the sample size has increased 12-fold from the previous simula- 
tions. Table 3 below lists estimates of the relative frequencies of changepoints 
found by the genetic algorithm with an MDL penalty when \i v is held con- 
stant with v. No trends were considered in this setup nor was tuning of the 
genetic algorithm (varying its mutation probabilities, etc.) considered in de- 
tail. The frequency distributions in Table 3 are approximately the same as 
those in Caussinus and Mestre (2004), perhaps slightly worse, in all cases. 
For this sample size and level of changepoint activity, an MDL penalty seems 
to perform about the same as the Caussinus-Lyazrhi penalty. Of course, we 
reiterate that some gains are made by considering monthly data in lieu of 
annual averages. 



Table 3 

Estimated number of changepoints for n — 100 



rn 


a = 1.0 


a = 2.0 


a = 3.0 





6.2% 


0.1% 


0.0% 


1 


29.5% 


0.3% 


0.0% 


2 


32.5% 


4.2% 


0.0% 


3 


22.8% 


5.4% 


0.0% 


4 


7.4% 


35.3% 


7.7% 


5 


1.5% 


33.0% 


4.4% 


6 


0.1% 


20.9% 


81.7% 


7 


0.0% 


0.8% 


6.1% 


>7 


0.0% 


0.0% 


0.1% 
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6. The Tuscaloosa data. Figure 3 plots a century of monthly data from 
Tuscaloosa, Alabama recorded from January, 1901-December, 2000. A sea- 
sonal mean cycle is visually evident in the data, but trends and mean 
shifts are not readily apparent. Comparing the year-to-year jaggedness of 
the seasonal throughs (the winter minimums) against the year-to-year sea- 
sonal peaks (July maximums), it is discerned that this series has a periodic 
variance with winter temperatures being much more variable than summer 
temperatures. In fact, as we will see, the entire autocorrelation structure of 
the series is periodic. 

The Tuscaloosa series is one in which the station history is reasonably 
documented. In particular, a catalog (called meta-data) exists that notes 
the circumstances under which the data were recorded, including the times 
of station relocations and instrumentation changes. This said, meta-data 
files are notoriously incomplete [Menne and Williams (2005)] and "undocu- 
mented" changepoints may lurk. The Tuscaloosa series also has a moderately 
clean changepoint record with only four major documented changes over a 
century of operation; as noted before, the average United States temperature 
station experiences about six changepoints per century [Mitchell (1953)]. In 
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Fig. 3. The Tuscaloosa data with changepoint structure imposed. 
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short, the Tuscaloosa is a good "proving ground" series for changepoint 
methods. 

Level shifts in temperature series are arguably the most important factor 
in assessing temperature trends [see Lu and Lund (2007)]. It has been argued 
in climate settings that the manner in which changepoints are handled may 
be the most critical factor in the global warming debate. Supporting this, 
temperature insurance treaties on Wall Street are based solely on station 
location and gauge properties, while ignoring long-term trends altogether. 

Our methods were applied to the Tuscaloosa data. A reference series was 
constructed by averaging three neighboring series located at Selma, AL, 
Greensboro, AL, and Aberdeen, MS over the century of record. We work 
with one reference series that averages three neighboring series to expedite 
the discourse; methods that analyze all L) pairs of stations are discussed in 
Menne and Williams (2005, 2009). 

First, we examine the Tuscaloosa series without a reference. The fitted 
MDL model has two changepoints at times 460 (April, 1939) and 679 (July, 
1957). The mean function induced by these two changepoints, less the sea- 
sonal cycle but including the trend, is plotted against the data in Figure 3. 
The mean shift magnitudes of the 1939 and 1957 changepoints, in degrees 
Celsius, are both negative: A 2 = -0.94 ± 0.20 and A 3 = -2.33 ± 0.30. The 
estimated trend parameter is a = 0.00258 ± 0.00039. The standard errors 
were estimated from the fitted time series regression model with generalized 
weighted least squares techniques. The estimated order of the PAR model 
is p = 1 ; a consequence of this is that the autocovariance structure in the 
errors of the fitted models is indeed periodic. 

Second, we examine the Tuscaloosa minus the reference series. This sea- 
sonally adjusted difference series is plotted in Figure 4. In this target minus 
reference, four changepoints are flagged: March 1909, December 1919, July 
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Fig. 4. The Tuscaloosa minus the reference data with changepoint structure imposed. 
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1933, and August 1990. The estimates of the Aj's are A 2 = 0.76 ± 0.11, 
A 3 = 0.26 ± 0.11, A 4 = 0.77 ± 0.13, and A 5 = 1.47 ± 0.19. Note that a mean 
shift with the small magnitude of 0.26 has been flagged. The trend estimate 
is a = —0.00066 ± 0.00015 and the selected order of the autoregression is 
p = 1. Observe that the fitted order of the autoregression did not reduce 
from that for the raw series; that is, periodic autocorrelation still exists 
in the target minus reference series. Also, the trend for the target minus 
reference series appears to be significantly negative. We comment that the 
two large negative values occurring in the 1940s and the 1950s appear to 
be decimal typos in the raw data; that is, the monthly average tempera- 
ture for Tuscaloosa was entered as ten degrees too small. We make this 
claim after examining additional reference series from various cities close to 
Tuscaloosa. Whereas the series in this database have been quality checked 
to some degree, errors like this may still exist. We reran the analysis above 
after replacing these two values by (1) their estimated seasonal means fi u 
and (2) what we believe are the correct values, that is, adding 10 degrees 
to both outliers. In both cases, four changepoints with similar magnitudes 
and times to the ones above are found. The 1957 changepoint flagged in the 
target series has not been flagged in any version of the target minus differ- 
ence series. The 1909 changepoint is possibly attributed to a changepoint 
in the reference series: Greensboro reports a time of observation change in 
1906 and Aberdeen reports a station relocation in 1915. 

The meta-data show four changepoints in this series: the first was a sta- 
tion relocation in November of 1921, the second was a station relocation in 
March of 1939, the third was a station relocation during June of 1956 and 
an accompanying instrumentation change in November of 1956 (we regard 
this as one changepoint), and the fourth is a station relocation and instru- 
mentation change in May of 1987. The reference series analysis seems to 
have correctly identified three of these four changepoints (we are liberally 
including the 1933 flagged changepoint time as correctly identifying the 1939 
changepoint), missing the 1956 changepoint and adding a 1909 changepoint. 
The raw target series analysis misses the 1921 and 1987 changepoints, but 
finds the 1956 changepoint; also, the estimated time of the 1939 changepoint 
is much closer to its true value than that for the reference analysis. Over- 
all, it seems that the reference analysis is superior to simple target series 
analysis, but that one can learn something with both analyses. 

We caution the reader that trends in some monthly temperature series, 
especially when the series is aggregated over a large geographic region, may 
not be well described by a linear regression component. As noted by Hand- 
cock and Wallis (1994), trends at localized series are more likely to be ade- 
quately described with a simple linear structure. As a final diagnostic check, 
residuals from the model fits were computed. Figure 5 shows the sample au- 
tocorrelation of the residuals for the target series over the first 60 lags. The 
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Fig. 5. Sample autocorrelations of target series residuals. 

dashed lines are 95% pointwise confidence bounds for white noise. As only 
three of the sample autocorrelations lie outside the bounds (and then only 
slightly so), the model appears to have fitted the data well. Figure 6 shows 
the periodogram of the residuals from the target minus reference series. A 
long memory structure is not readily evident in this plot. 

7. Comments. Time series parsimony may be an issue with periodic 
data. Specifically, the penalty in (3.3) essentially assumes that the PAR(l) 
model requires (j>+ 1)T distinct parameters. In practice, changes in climate 
processes from season to season are slow/smooth. Low order Fourier series 
expansions, such as those in Lund, Shao, and Basawa (2005), can statis- 
tically simplify the model and serve to lessen the penalty for time series 
components. This issue is likely to be paramount should daily data be con- 
sidered. 
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